Flexible failover policies in high availability computing systems

ABSTRACT

A system for implementing a failover policy includes a cluster infrastructure for managing a plurality of nodes, a high availability infrastructure for providing group and cluster membership services, and a high availability script execution component operative to receive a failover script and at least one failover attribute and operative to produce a failover domain. In addition, a method for determining a target node for a failover comprises executing a failover script that produces a failover domain, the failover domain having an ordered list of nodes, receiving a failover attribute and based on the failover attribute and failover domain, selecting a node upon which to locate a resource.

CROSS-REFERENCE TO RELATED APPLICATIONS

The present application is a continuation and claims the prioritybenefit of U.S. patent application Ser. No. 14/288,079 filed May 27,2014, issuing as U.S. Pat. No. 9,405,640, which is a continuation andclaims the priority benefit of U.S. patent application Ser. No.12/891,390 filed Sep. 27, 2010, now U.S. Pat. No. 8,769,132, which is acontinuation and claims the priority benefit of U.S. patent applicationSer. No. 09/997,404 filed Nov. 29, 2001, which is a continuation andclaims the priority benefit of U.S. patent application Ser. No.09/811,357 filed Mar. 16, 2001, which claims the priority benefit ofU.S. provisional application 60/189,864 filed Mar. 16, 2000, thedisclosures of which are incorporated herein by reference.

COPYRIGHT NOTICE/PERMISSION

A portion of the disclosure of this patent document contains materialwhich is subject to copyright protection. The copyright owner has noobjection to the facsimile reproduction by anyone of the patent documentor the patent disclosure as it appears in the Patent and TrademarkOffice patent file or records, but otherwise reserves all copyrightrights whatsoever. The following notice applies to the software and dataas described below and in the drawings hereto: Copyright© 2000, 2001Silicon Graphics Incorporated, All Rights Reserved.

BACKGROUND OF THE INVENTION

Field of the Invention

The present invention is related to computer processing, and moreparticularly to providing flexible failover policies on highavailability computer processing systems.

Description of the Related Art

Companies today rely on computers to drive all aspects of theirbusiness. Certain business functions can survive intermittentinterruptions in service; others cannot.

To date, attempts to ensure high availability to mission criticalapplications have relied on two approaches. Applications have been mademore available either through the use of specialized fault toleranthardware or through cumbersome changes to the applications or to theenvironment in which the applications run. These approaches increase thecosts to the organization of running the applications. In addition,certain approaches to making applications more available increase therisk of introducing errors in the underlying data.

What is needed is a system and method of increasing the availability ofmission critical applications by providing greater failover flexibilityin determining the targets for moving resources from a machine that hasfailed.

SUMMARY OF THE PRESENTLY CLAIMED INVENTION

To address the problems stated above, and to solve other problems thatwill become apparent in reading the specification and claims, a highavailability computing system and method are described. The highavailability computing system includes a plurality of servers connectedby a first and a second network, wherein the servers communicate witheach other to detect server failure and transfer applications to otherservers on detecting server failure through a process referred to as“failover”.

According to another aspect of the present invention, a system forimplementing a failover policy includes a cluster infrastructure formanaging a plurality of nodes, a high availability infrastructure forproviding group and cluster membership services, and a high availabilityscript execution component operative to receive a failover script and atleast one failover attribute and operative to produce a failover domain.

According to another aspect of the invention, a method for determining atarget node for a failover comprises executing a failover script thatproduces a failover domain, the failover domain having an ordered listof nodes, receiving a failover attribute and based on the failoverattribute and failover domain, selecting a node upon which to locate aresource.

BRIEF DESCRIPTION OF THE DRAWINGS

FIG. 1A shows a diagram of the hardware and operating environment inconjunction with which embodiments of the invention may be practiced;

FIG. 1B is a diagram illustrating an exemplary node configurationaccording to embodiments of the invention; and

FIG. 2 is a flowchart illustrating a method for providing failoverpolicies according to an embodiment of the invention.

DETAILED DESCRIPTION

In the following detailed description, reference is made to theaccompanying drawings which form a part hereof, and in which is shown byway of illustration specific embodiments in which the invention may bepracticed. It is to be understood that other embodiments may be utilizedand structural changes may be made without departing from the scope ofthe present invention.

Some portions of the detailed descriptions which follow are presented interms of algorithms and symbolic representations of operations on databits within a computer memory. These algorithmic descriptions andrepresentations are the ways used by those skilled in the dataprocessing arts to most effectively convey the substance of their workto others skilled in the art. An algorithm is here, and generally,conceived to be a self-consistent sequence of steps leading to a desiredresult. The steps are those requiring physical manipulations of physicalquantities. Usually, though not necessarily, these quantities take theform of electrical or magnetic signals capable of being stored,transferred, combined, compared, and otherwise manipulated. It hasproven convenient at times, principally for reasons of common usage, torefer to these signals as bits, values, elements, symbols, characters,terms, numbers, or the like. It should be borne in mind, however, thatall of these and similar terms are to be associated with the appropriatephysical quantities and are merely convenient labels applied to thesequantities. Unless specifically stated otherwise as apparent from thefollowing discussions, it is appreciated that throughout the presentinvention, discussions utilizing terms such as “processing” or“computing” or “calculating” or “determining” or “displaying” or thelike, refer to the action and processes of a computer system, or similarcomputing device, that manipulates and transforms data represented asphysical (e.g., electronic) quantities within the computer system'sregisters and memories into other data similarly represented as physicalquantities within the computer system memories or registers or othersuch information storage, transmission or display devices.

DEFINITIONS

A number of computing terms will be used throughout this specification.In this specification, a cluster node is a single computer system.Usually, a cluster node is an individual computer. The term node is alsoused for brevity. When one node fails, other nodes are left intact andable to operate.

A pool is the entire set of nodes involved with a group of clusters. Thegroup of clusters are usually close together and should always serve acommon purpose. A replicated database is stored on each node in thepool.

A cluster is a collection of one or more nodes coupled to each other bynetworks or other similar interconnections. A cluster is identified by asimple name; this name must be unique within the pool. A particular nodemay be a member of only one cluster. All nodes in a cluster are also inthe pool: however, all nodes in the pool are not necessarily in thecluster.

A node membership is the list of nodes in a cluster on which HighAvailability base software can allocate resource groups.

A process membership is the list of process instances in a cluster thatform a process group. There can be multiple process groups per node.

A client-server environment is one in which a set of users operate on aset of client systems connected through a network to a set of serversystems. Often, applications within a client-server system are dividedinto two components: a client component and a server component. Eachcomponent can run on the same of different nodes. A process running theclient component of the application is called a client; a processrunning the server component is called a server.

Clients send requests to servers and collect responses from them. Notall servers can satisfy all requests. For instance, a class of Oracledatabase servers might be able to satisfy requests regarding theemployees of a company, while another class might be able to satisfyrequests regarding the company's products.

Servers that are able to satisfy the same type of requests are said tobe providing the same service. The time interval between the event ofposting a request and the event of receiving a response is calledlatency.

Service availability can be defined by the following example. Consider aweb service implemented by a set of web servers running on a singlesystem. Assume that the system suffers an operating system failure.After the system is rebooted, the web servers are restarted and clientscan connect again. A failure of the servers therefore appears to clientslike a long latency.

A service is said to be unavailable to a client when latencies becomegreater than a certain threshold, called critical latency. Otherwise, itis available. A service is down when it is unavailable to all clients;otherwise, it is up. An outage occurs when a service goes down. Theoutage lasts until the service comes up again. If downtime is the sum ofthe durations of outages over a certain time interval D=R, a for acertain service S, service availability can be defined as:

avail(S)=1-downtime/(t′-t) where t′-t is a large time interval,generally a year. For instance, a service which is available 99.99%should have an yearly downtime of about an hour. A service that isavailable 99.99% or higher is generally called highly available.

Service outages occur for two reasons: maintenance (e.g. hardware andsoftware upgrades) and failures (e.g. hardware failures, OS crashes).

Outages due to maintenance are generally considered less severe. Theycan be scheduled when clients are less active, for instance, during aweekend. Users can get early notification. Downtime due to maintenanceis often called scheduled downtime. On the other hand, failures tend tooccur when the servers are working under heavy load, i.e. when mostclients are connected. Downtime due to failures is often calledunscheduled downtime. Some time service availability is measuredconsidering only unscheduled downtime.

Vendors often provide figures for system availability. Systemavailability is computed similarly to service availability. The downtimeis obtained by multiplying the average number of system failures (OScrashes, HW failures, . . . ) by the average repair time.

Consider a service whose servers are distributed on a set of N (whereN>1) nodes in a cluster. For the service to be unavailable, all of the Nnodes must fail at the same time. Since most of system failures arestatistically independent, the probability of such an event is pN, wherep is the probability of a failure of a single system. For example, givena cluster of 2 nodes with availability of 99.7% for each node, at anygiven time, there is a 0.3% or 0.003 probability that a node isunavailable. The probability of both nodes being unavailable at the sametime is 0.003 2=0.000009 or 0.0009%. The cluster as a whole thereforehas a system availability of 99.9991% or (1-0.000009). Systemavailability of a cluster is high enough to allow the deployment ofhighly available services.

A resource is a single physical or logical entity that provides aservice to clients or other resources. For example, a resource can be asingle disk volume, a particular network address, or an application suchas a web server. A resource is generally available for use over time ontwo or more nodes in a cluster, although it can be allocated to only onenode at any given time.

Resources are identified by a resource name and a resource type. Oneresource can be dependent on one or more other resources: If so, it willnot be able to start (that is, be made available for use) unless thedependent resources are also started. Dependent resources must be partof the same resource group and are identified in a resource dependencylist.

A resource name identifies a specific instance of a resource type. Aresource name must be unique for a given resource type.

A resource type is a particular class of resource. All of the resourcesin a particular resource type can be handled in the same way for thepurposes of failover. Every resource is an instance of exactly oneresource type.

A resource type is identified by a simple name: this name must be uniquewithin the cluster. A resource type can be defined for a specific node,or It. can be defined for an entire cluster. A resource type that isdefined for a specific node overrides a cluster-wide resource typedefinition with the same name: this allows an-individual node tooverride global settings from a cluster-wide resource type definition.

Like resources, a resource type can be dependent on one or more otherresource types. If such a. dependency exists, at least one instance ofeach of the dependent resource types must be defined. For example, aresource type named Netscape_web might have resource type dependencieson resource types named IP_address and volume. If a resource named webis defined with the Netscape_web resource type, then the resource groupcontaining web must also contain at least one resource of the typeIP_address and one resource of the type volume.

In one embodiment, predefined resource types are provided. However, auser can create additional resource types.

A resource group is a collection of interdependent resources. A resourcegroup is identified by a simple name: this name must be unique within aduster. Table 1 shows an example of the resources for a resource groupnamed WebGroup.

TABLE 1 Resource Resource Type Vol1 volume lfs1 files ystem 199.10.48.22IP_address Oracle_DB Application

In some embodiments, if any individual resource in a resource groupbecomes unavailable for its intended use, then the entire resource groupis considered unavailable. In these embodiments, a resource group is theunit of failover for the High Availability base software.

In some embodiments of the invention, resource groups cannot overlap:that is two resource groups cannot contain the same resource.

A resource dependency list is a list of resources upon which a resourcedepends. Each resource instance must have resource dependencies thatsatisfy its resource type dependencies before it can be added to aresource group.

A resource type dependency list is a list of resource types upon which aresource type depends. For example, the filesystem resource type dependsupon the volume resource type, and the Netscape_web resource typedepends upon the filesystem and IP_address resource types.

For example, suppose a file system instance/fs1 is mounted onvolume/vol1. Before/fs1 can be added to a resource group. /fs1 must bedefined to depend on/vol1. the High Availability base software onlyknows that a file system instance must have one volume instance in itsdependency list. This requirement is inferred from the resource typedependency list.

A failover is the process of allocating a resource group (orapplication) to another node, according to a failover policy A failovermay be triggered by the failure of a resource, a change in the nodemembership (such as when a node fails or starts), or a manual request bythe administrator.

A failover policy is the method used by High Availability base softwareto determine the destination node of a failover. A failover policyconsists of the following:

Failover domain

Failover attributes

Failover script

The administrator can configure a failover policy for each resourcegroup. A failover policy name must be unique within the pool.

A failover domain is the ordered list of nodes on which a given resourcegroup can be allocated. The nodes listed in the failover domain must bewithin the same cluster. However, the failover domain does not have toinclude every node in the cluster.

The administrator defines the initial failover domain when creating afailover policy. This list is transformed into a run-time failoverdomain by the failover script. High Availability base software uses therun-time failover domain along with failover attributes and the nodemembership to determine the node on which a resource group shouldreside. High Availability base software stores the run-time failoverdomain and uses it as input to the next failover script invocation.Depending on the run-time conditions and contents of the failoverscript, the initial and run-time failover domains may be identical.

In general, High Availability base software allocates a given resourcegroup to the first node listed in the run-time failover domain that isalso in the node membership: the point at which this allocation takesplace is affected by the failover attributes.

A failover attribute is a string that affects the allocation of aresource group in a cluster. The administrator must specify systemattributes (such as Auto_Failback or Controlled_Failback) and canoptionally supply site-specific attributes.

A failover script is a shell script that generates a run-time failoverdomain and returns it to the High Availability base software process.The High Availability base software process applies the failoverattributes and then selects the first node in the returned failoverdomain that is also in the current node membership.

The action scripts are the set of scripts that determine how a resourceis started, monitored, and stopped. Typically, there will be a set ofaction scripts specified for each resource type.

The following is the complete set of action scripts that can bespecified for each resource:

-   -   probe, which verifies that the resource is configured on a        server    -   exclusive, which verifies that the resource is not already        running    -   start, which starts the resource    -   stop, which stops the resource    -   monitor, which monitors the resource    -   restart, which restarts the resource on the same server after a        monitoring failure occurs

Highly Available Services Overview

Highly Available (HA) services can be provided in two ways. First, amulti-server application using built-in or highly available services,can directly provide HA services. In the alternative, a single-serverapplication layered on top of multi-server highly available systemservices can provide equivalent HA services. In other words, asingle-server application may depend on a special application which usesthe multi-server application discussed above.

FIG. 1A illustrates an exemplary environment for providing HA services.As shown, the environment 10 includes nodes 20, clients 24, and database26, all communicably coupled by a network 22. Each of nodes 20.1-20.5are computer systems comprising hardware and software and can provideservices to clients 24. Thus nodes 20 can also be referred to asservers. Specifically, processes 26 comprise software that providesservices to client 24. Moreover, each of nodes 20.1 can be suitablyconfigured to provide high availability services according to thevarious embodiments of the invention.

FIG. 1B provides further detail of software layers according to anembodiment of the invention that can be run on nodes 20 to support HAservices. As illustrated, software running on a node 20 includes clusterinfrastructure 12, HA infrastructure 14, HA base software 16,application plug-ins 28, and processes 26.

Cluster infrastructure 12 includes software components for performingthe following:

Node logging

Cluster administration

Node definition

In one embodiment, the cluster software infrastructure includesclusteruster_admin and cluster_control subsystems. HA infrastructure 14provides software components to define clusters, resources, and resourcetypes. In one embodiment, the HA infrastructure includes the following:

-   -   Cluster membership daemon. Provides the list of nodes, called        node membership, available to the cluster.    -   Group membership daemon. Provides group membership and reliable        communication services In the presence of failures to HA base        software 12 processes.    -   Start daemon. Starts HA base software daemons and restarts them        on failures.    -   System resource manager daemon. Manages resources, resource        groups and resource types. Executes action scripts for        resources.    -   Interface agent daemon. Monitors the local node's network        Interfaces.

Further details on the cluster infrastructure 12 and HA infrastructure14 can be found in the cofiled, copending, U.S. patent application Ser.No. 09/811,158 entitled “MAINTAINING MEMBERSHIP IN HIGH AVAILABILITYCOMPUTING SYSTEMS”, previously incorporated by reference.

HA base software 16 provides end-to-end monitoring of services andclient in order to determine whether resource load balancing or failoverare required. In one embodiment, HA base software 16 is the IRISFailSafe product available from Silicon Graphics, Inc. In thisembodiment, HAbase software includes the software required to make thefollowing high-availability services:

IP addresses (the IP_address resource type)

XLV logical volumes (the volume resource type)

XFS file systems (the filesystem resource type)

MAC addresses (the MAC_address resource type)

In one embodiment of the invention, application plug-ins 28 comprisesoftware components that provide an interface to convert applicationssuch as processes 26 into high-availability services. For example,application plug-ins 26 can include database agents. Each database agentmonitors all instances of one type of database. In one embodiment,database agents comprise the following:

IRIS FailSafe Oracle

IRIS FailSafe INFORMIX

IRIS FailSafe Netscape Web

IRIS FailSafe Mediabase

Processes 26 include software applications that can be configured toprovide services. It is not a requirement that any of processes 26 beintrinsically HA services. Application plug-ins 18, along with HA basesoftware 16 can be used to turn processes 26 into HA services. In orderfor a plug-in to be used to turn a process 26 into an HA application, itis desirable that the process 26 have the following characteristics:[0088] The application can be easily restarted and monitored. [0089] Itshould be able to recover from failures as do most client/serversoftware. The failure could be a hardware failure, an operating systemfailure, or an application failure. If a node crashed and reboots,client/server software should be able to attach again automatically.

-   -   The application must have a start and stop procedure.    -   When the resource group fails over, the resources that        constitute the resource group are stopped on one node and        started on another node, according to the failover script and        action scripts.    -   The application can be moved from one node to another after        failures.    -   If the resource has failed, it must still be possible to run the        resource stop procedure. In addition, the resource must recover        from the failed state when the resource start procedure is        executed in another node.    -   The application does not depend on knowing the host name. That        is, those resources that can be configured to work with an IP        address.

It should be noted that an application process 26 itself is not modifiedto make it into a high-availability service.

In addition, node 20 can include a database (not shown). The databasecan be used to store information including:

Resources

Resource types

Resource groups

Failover policies

Nodes

Clusters

In one embodiment, a cluster administration daemon (cad) maintainsidentical databases on each node in the cluster.

Method

The previous section described an overview of a system for providinghigh availability services and failover policies for such services. Thissection will provide a description of a method 200 for providingfailover policies for high availability services. The methods to beperformed by the operating environment constitute computer programs madeup of computer-executable instructions. Describing the methods byreference to a flowchart enables one skilled in the art to develop suchprograms including such instructions to carry out the methods onsuitable computers (the processor of the computer executing theinstructions from computer-readable media). The method illustrated inFIG. 2 is inclusive of the acts required to be taken by an operatingenvironment executing an exemplary embodiment of the invention.

The method 200 begins when a software component, such as HA basesoftware 16 (FIG. 1A) executes a failover script for a resource inresponse to either a failover event such as a node or process failure,or a load balancing event such as a resource bottleneck or processorload (block 202). The failover script can be programmed in any of anumber of languages, include Java, perl, shell (Bourne, C-Shell, Kornshell etc.) or the C programming language. The invention is not limitedto a particular programming language. In one embodiment of theinvention, the following scripts can be executed:

-   -   [0106] probe, which verifies that the resource is configured on        a node    -   exclusive, which verifies that the resource is not already        running    -   start, which starts the resource    -   stop, which stops the resource    -   monitor, which monitors the resource    -   restart, which restarts the resource on the same node when a        monitoring failure occurs

It should be noted that in some embodiments, the start, stop, andexclusive scripts are required for every resource type. A monitor scriptmay also be required, but if need be only a return-success function. Arestart script may be required if the restart mode is set to 1; however,this script may contain only a return-success function. The probe scriptis optional.

In some embodiments, there are two types of monitoring that may beaccomplished in a monitor script:

-   -   Is the resource present?    -   Is the resource responding?

For a client-node resource that follows a protocol, the monitoringscript can make a simple request and verify that the proper response isreceived. For a web node, the monitoring script can request a home page,verify that the connection was made, and ignore the resulting home page.For a database, a simple request such as querying a table can be made.

Next, a system executing the method receives a failover domain as outputfrom the failover script (block 204). The failover script can receive aninput domain, apply script logic, and provide an output domain. Theoutput domain is an ordered list of nodes on which a given resource canbe allocated.

Next, the system receives failover attributes (block 206). The failoverattributes are used by the scripts and by the HA base software tomodifying the run-time failover domain used for a specific resourcegroup. Based on the failover domain and attributes, the methoddetermines a target node for the resource (block 208). Once a targetnode has been determined, the system can cause the resource to start onthe target node.

In the above discussion and in the attached appendices, the term“computer” is defined to include any digital or analog data processingunit. Examples include any personal computer, workstation, set top box,mainframe, server, supercomputer, laptop or personal digital assistantcapable of embodying the inventions described herein.

Examples of articles comprising computer readable media are floppydisks, hard drives, CD-ROM or DVD media or any other read-write orread-only memory device.

Although specific embodiments have been illustrated and describedherein, it will be appreciated by those of ordinary skill in the artthat any arrangement which is calculated to achieve the same purpose maybe substituted for the specific embodiment shown. This application isintended to cover any adaptations or variations of the presentinvention. Therefore, it is intended that this invention be limited onlyby the claims and the equivalents thereof.

What is claimed is:
 1. A method for providing failover policies, themethod comprising: executing instructions stored in memory by aprocessor, wherein the executed instructions: detect a failover eventassociated with an initial node, wherein the detected failover eventincludes identifying a node failure and a resource associated with theinitial node, execute a failover script based on the detected failoverevent, wherein the executed failover script outputs an ordered list ofalternative nodes to which a resource associated with the initial nodecan be reallocated, receive failover attributes associated with theordered list of alternative nodes, wherein the received failoverattributes specifies how a particular resource can be reallocated, andidentify a target replacement node from the ordered list of alternativenodes using the received failover attributes, wherein the identifiedtarget replacement node is used to reallocate the resource associatedwith the initial node using the received failover attributes; andre-allocating the resource associated with the initial node onto theidentified target replacement node.